Practical Considerations for Dimensionality Reduction
نویسندگان
چکیده
In this thesis, we explore ways to make practical extensions to Dimensionality Reduction, or DR algorithms with the goal of addressing challenging, real-world cases. The first case we consider is that of how to provide guidance to those users employing DR methods in their data analysis. We specifically target users who are not experts in the mathematical concepts behind DR algorithms. We first identify two levels of guidance: global and local. Global user guidance helps non-experts select and arrange a sequence of analysis algorithms. Local user guidance helps users select appropriate algorithm parameter choices and interpret algorithm output. We then present a software system, DimStiller, that incorporates both types of guidance, validating it on several use-cases. The second case we consider is that of using DR to analyze datasets consisting of documents. In order to modify DR algorithms to handle document datasets effectively, we first analyze the geometric structure of document datasets. Our analysis describes the ways document datasets differ from other kinds of datasets. We then leverage these geometric properties for speed and quality by incorporating ideas from text querying into DR and other algorithms for data analysis. We then present the Overview prototype, a proof-of-concept document analysis system. Overview synthesizes both the goals of designing systems for data analysts who are DR novices, and performing DR on document data. The third case we consider is that of costly distance functions, or when the method used to derive the true proximity between two data points is computationally expensive. Using standard approaches to DR in this important use-case can result in either unnecessarily protracted runtimes or long periods of user monitor-
منابع مشابه
Lecture : Some Practical Considerations ( 4 of 4 )
Today, we will continue talking about the connection between non-linear dimensionality reduction methods and diffusions and random walks. We will also describe several methods for constructing graphs that are “semi-supervised,” in the sense that that there are labels associated with some of the data points, and we will use these labels in the process of constructing the graph. These will give r...
متن کامل2D Dimensionality Reduction Methods without Loss
In this paper, several two-dimensional extensions of principal component analysis (PCA) and linear discriminant analysis (LDA) techniques has been applied in a lossless dimensionality reduction framework, for face recognition application. In this framework, the benefits of dimensionality reduction were used to improve the performance of its predictive model, which was a support vector machine (...
متن کاملA Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters
Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of ...
متن کاملLecture : Some Practical Considerations ( 1 of 4 ) Lecturer
Today, we will shift gears. So far, we have gone over the theory of graph partitioning, including spectral (and non-spectral) methods, focusing on why and when they work. Now, we will describe a little about how and where these methods are used. In particular, for the next few classes, we will talk somewhat informally about some practical issues, e.g., how spectral clustering is done in practic...
متن کاملانجام یک مرحله پیش پردازش قبل از مرحله استخراج ویژگی در طبقه بندی داده های تصاویر ابر طیفی
Hyperspectral data potentially contain more information than multispectral data because of their higher spectral resolution. However, the stochastic data analysis approaches that have been successfully applied to multispectral data are not as effective for hyperspectral data as well. Various investigations indicate that the key problem that causes poor performance in the stochastic approaches t...
متن کامل